CoVarNet is a computational framework aiming to unravel the coordination among multiple cell types by analyzing the covariance in the frequencies of cell types across various samples.

CoVarNet has generalizability to be applied to external datasets to recover cellular modules. This tutorial shows how to use CoVarNet method to recover co-occurring cellular modules in scRNA-seq data and spatial transcriptomics data. Notably, the recovery of cellular modules requires CM composition file as a reference, derived as described in Tutorial 1. We provide an example reference file data/network.rds, which is defined from the comprehensive pan-tissue cell atlas. Also, users can establish a reference file on user-defined CMs from their own data.

Loading packages

library(CoVarNet)
The R packages listed below are required for running CoVarNet. The version numbers indicate the package versions used for testing the CoVarNet code. Other R versions might work too.
  • R (v4.1.2).
  • R packages: dplyr(v1.1.4), NMF(v0.30.1), Seurat(v5.1.0), cluster(v2.1.6), sp(2.1-4), spdep(v1.3-5), igraph(v1.6.0), circlize(v0.4.15), ComplexHeatmap (v2.15.4), ggsci(v3.0.3), grid(v4.1.2), psych(v2.4.3), RColorBrewer(v1.1-3), ggplot2(v3.5.0), viridis(v0.6.5), tidytext(v0.4.1), dendextend(v1.17.1).

  • These packages, together with the other resources pre-stored in the CoVarNet folder, allow users to:
  • Perform the discovery of cellular modules in scRNA-seq data (Tutorial 1).
  • Perform the recovery of cellular modules in scRNA-seq data and spatial transcriptomics data (Tutorials 2).
  • Recovery of cellular modules in scRNA_seq data

    Before the recovery, users should initially annotate the broad cell types and cell subsets of each cell, and then prepare the cell type annotation file. This file should contain at least three columns: “sampleID”, “majorCluster” which contains the broad cell types, “subCluster” which contains the cell subsets. Here we use a subset (100 randomly selected samples) of the pan-tissue cell atlas as an example, available in data/sc_recover_example.rds. The reference file is available in data/network.rds.

    meta<-readRDS("./data/example_recovery.rds")
    meta[1:5,1:3]
    ##                                 sampleID majorCluster  subCluster
    ## D01-TGGGCGTCATCGGACC-1-marrow BoneMarrow            B B01_Bn_IGHM
    ## D01-TCAGCTCTCCTTGGTC-1-marrow BoneMarrow            B B01_Bn_IGHM
    ## D01-TGCGTGGCAGATAATG-1-marrow BoneMarrow            B B01_Bn_IGHM
    ## D01-TAGCCGGGTGCAGACA-1-marrow BoneMarrow            B B01_Bn_IGHM
    ## D01-CCGTGGATCGTCCGTT-1-marrow BoneMarrow            B B01_Bn_IGHM

    The annotation of cell subsets should be consistent with the reference file.

    ref=readRDS("data/network.rds")
    identical(sort(unique(meta$subCluster)),sort(rownames(basis(ref$raw))))
    ## [1] TRUE

    Cell subset frequency matrix

    In accordance with Tutorial 1, filter the cell annotation file and obtain the normalized frequency matrix.

    meta <- meta[meta$cellSort %in% c("Mix", "Total"), ]  
    rt_sp <- names(table(meta$sampleID))[table(meta$sampleID) >= 100]
    meta <- meta[meta$sampleID %in% rt_sp, ]
    mat_fq_raw<-freq_calculate(meta)
    ## `mutate_if()` ignored the following grouping variables:
    ## • Column `majorCluster`
    mat_fq_norm<-freq_normalize(mat_fq_raw,normalize="minmax")

    NMF recovery

    To recover CMs in each sample, we use the feature matrix (cell subset×CM) predefined in Tutorial 1 to predict the coefficient matrix (CM×sample), employing the NMFpredict function from Luca et al. Cell.
    For standard Nonnegative Matrix Factorization models V≈WH, When V and W are known, the calculation of H can be carried out through the following steps:
  • Initialize H as a random nonnegative matrix.
  • Iteratively update H until convergence is achieved or the maximum number of iterations is reached.
  • The error estimation is measured by the Kullback-Leibler divergence.
  • The command line is:

    mat_cm<-sc_cm_recover(ref,mat_fq_norm)

    The output of this command is H: the activity of CMs in each sample.

    mat_cm[1:12,1:3]
    ##               145 621B-BoneMarrow 640C-LymphNode
    ## CM01 2.302459e-02    2.220446e-16   6.816797e-02
    ## CM02 2.220446e-16    1.472005e+00   9.169505e-02
    ## CM03 2.220446e-16    2.220446e-16   3.602602e-01
    ## CM04 2.220446e-16    3.173015e-01   2.502810e-01
    ## CM05 2.220446e-16    1.455042e+00   6.269249e+00
    ## CM06 2.220446e-16    5.516095e-01   1.318212e-01
    ## CM07 1.160867e+00    2.220446e-16   2.220446e-16
    ## CM08 2.220446e-16    2.220446e-16   2.220446e-16
    ## CM09 2.220446e-16    1.494245e+00   5.764557e-01
    ## CM10 2.220446e-16    2.220446e-16   2.220446e-16
    ## CM11 2.220446e-16    1.525283e-01   6.039352e-02
    ## CM12 2.602212e-02    2.220446e-16   2.220446e-16

    Categorize all samples into exclusive CM types.

    max_cm <- apply(mat_cm, 2, function(x) rownames(mat_cm)[which.max(x)])
    max_cm<-gsub("CM", "CMT",max_cm)
    max_cm[1:10]
    ##               145   621B-BoneMarrow    640C-LymphNode       640C-Spleen 
    ##           "CMT07"           "CMT09"           "CMT05"           "CMT05" 
    ##      A16_TH_TOT_3      A16_TH_TOT_5 A16_TH_TOT_5GEX_4      A16_TH_TOT_7 
    ##           "CMT09"           "CMT09"           "CMT09"           "CMT09" 
    ##    A26-TCL-0-SC-1    A32-MLN-0-SC-1 
    ##           "CMT03"           "CMT09"

    Visualizing the distribution of CMs across samples

    The heatmap displays CMs activity profiles across tissues. Each sample is assigned a dominant CM, and all samples are ordered by their CM identities and tissue types.

    gr.distribution(mat_cm,meta=meta,group="tissue") 

    Recovery of cellular modules in spatial transcriptomics data

    We posited that distinct cell subsets within each CM are spatially orchestrated to respond collectively. Thus, we map the CMs to spatially resolved transcriptomics data to investigate the spatial characteristics of CMs.
    Here we use ileum data generated by 10X Visium as an example (presented in Bergenstråhle et al. Nature Biotechnology). Before the recovery, we use cell2location to estimate the cell abundance per cell subset in each spot. Similar to the requirement in the recovery of scRNA-seq data, The annotation of cell subsets should be consistent with the reference file data/network.rds. Four example data after deconvolution are available in data/Ileum_1.rds, data/Ileum_2.rds, data/Ileum_3.rds, data/Ileum_4.rds.
    For users who use their own spatial data to recover cellular modules, other annotation methods might work too.

    spe=readRDS("./data/Ileum_1.rds")
    ref=readRDS("data/network.rds")
    spe@meta.data[,rownames(basis(ref$raw))][1:5,1:3]
    ##                    B01_Bn_IGHM B02_Bn_TCL1A B03_Bn_NR4A2
    ## AAACAAGTATCTCCCA-1 0.007082924   0.03670086  0.004070013
    ## AAACACCAATAACTGC-1 0.008483844   0.04575415  0.005342734
    ## AAACAGAGCGACTCCT-1 0.049516937   0.24134531  0.013165244
    ## AAACAGCTTTCAGAAG-1 0.100642546   0.72028510  0.041941236
    ## AAACAGGGTCTATATT-1 0.019476875   0.11386335  0.012350037

    We can use the same method to predict the activity of CMs in each spot through the function sc_cm_recover. Subsequently, visualize the spatial distribution of cellular modules using Seurat::SpatialFeaturePlot.

    mat_fq_norm<-freq_normalize_st(spe,ref,normalize="minmax")
    ## `mutate_if()` ignored the following grouping variables:
    ## • Column `majorCluster`
    mat_cm<-sc_cm_recover(ref,mat_fq_norm)
    spe=AddMetaData(object = spe,
                      metadata = t(mat_cm),
                      col.name = rownames(mat_cm))
    SpatialFeaturePlot(spe,features="CM03")


    Additionally, we construct CM networks by interconnecting co-occurring nodes through specifically correlated edges in Tutorial 1, and thus obtain the cell subset components of each CM. With this specific cell compositions, we can also recover CMs by calculating the weighted sum of cell subset abundance in each spot.

    spe=readRDS("./data/Ileum_1.rds")
    spe=st_cm_recover(ref$filter,spe)
    ## No id variables; using all as measure variables
    p1<-SpatialFeaturePlot(spe,features="CM03",pt.size.factor = 0)
    p2<-SpatialFeaturePlot(spe,features=c("CM03","B12_Plasma_IGHA2","CD8T03_Trm_ITGA1","S07_Fb_TCF21","CD4T06_Tm_ANXA1","S20_Glia_CDH19"),ncol=3)
    p3<-SpatialFeaturePlot(spe,features=c("CM05","CD8T02_Tem_GZMK","CD4T03_Tn_NR4A1","B03_Bn_NR4A2","B05_Bm_NR4A2","CD4T04_Tfh_IL6ST"),ncol=3)

    H&E staining image of ileum_1

    p1

    p2

    p3


    The cell subsets of CM03 and CM05 exhibited notable colocalization respectively, with CM03 enriched in the intestinal mucosa and CM05 enriched in the Peyer’s patch region.

    The heatmap quantifies spatial proximity among the cell subsets of CM03 and CM05. Here We calculate Global Bivariate Moran’s I (using spdep R package) between cell subsets on a single spatial slide.

    coords<-spe@images$slice1@boundaries$centroids@coords
    subsets=colnames(spe@meta.data)[4:79]
    st_fq<-spe@meta.data[,subsets]
    gr.proximity(coords,st_fq,ref,c("CM03","CM05"))
    ## Registered S3 methods overwritten by 'proxy':
    ##   method               from    
    ##   print.registry_field registry
    ##   print.registry_entry registry


    Evaluating spatial proximity of cellular modules

    Colocalization

    To evaluate the spatial colocalization of CM components in spatial transcriptomics, we calculated Spearman correlation coefficients for each pair of cell-subset components across spots of a single slide. The initial colocalization score for a CM was defined as the median of these correlation coefficients. Additionally, a background score was determined as the median of correlation coefficients for pairs, where one cell subset was part of the CM and the other was not. The final colocalization score was ascertained by subtracting the background score from the initial.

    spe1=readRDS("./data/Ileum_1.rds")
    st_fq<-spe1@meta.data[,subsets]
    col=colocalization(st_fq,ref)
    col
    ##      cm     score
    ## 1  CM01 0.4896552
    ## 2  CM02 0.6950355
    ## 3  CM03 0.6171329
    ## 4  CM04 0.6794326
    ## 5  CM05 0.6936170
    ## 6  CM06 0.7104851
    ## 7  CM07 0.5041096
    ## 8  CM08 0.6666667
    ## 9  CM09 0.8283688
    ## 10 CM10 0.6821918
    ## 11 CM11 0.6114943
    ## 12 CM12 0.5416667

    Calculate the colocalization of CMs in multiple samples. The highest concordance was observed in CMs with a higher proportion of lymphocytes.

    Aggregation

    To assess the regional aggregation of cell-subset components within CMs in spatial transcriptomics, we utilized the global bivariate Moran’s I using the spdep R package. The initial aggregation score for each CM was determined by calculating the median Moran’s I values across all cell-subset component pairs within the CM. A background score was similarly derived from the median Moran’s I values of pairs consisting of one cell subset from the CM and one from outside. The final aggregation score was calculated by subtracting the background score from the initial.

    coords<-spe1@images$slice1@boundaries$centroids@coords
    bvM=bvMoranI(coords,st_fq,ref)
    bvM
    ##      cm     score
    ## 1  CM01 0.5655172
    ## 2  CM02 0.6815603
    ## 3  CM03 0.7019231
    ## 4  CM04 0.7602837
    ## 5  CM05 0.7028369
    ## 6  CM06 0.7527387
    ## 7  CM07 0.5534247
    ## 8  CM08 0.6377152
    ## 9  CM09 0.8007092
    ## 10 CM10 0.7383562
    ## 11 CM11 0.5390805
    ## 12 CM12 0.5406746

    Calculate the aggregation of CMs in multiple samples. Consistent with the previous conclusion, the highest concordance was observed in CMs with a higher proportion of lymphocytes.